Creating a Feasible Corpus for Persian POS Tagging

نویسندگان

  • Farhad Oroumchian
  • Samira Tasharofi
  • Hadi Amiri
  • Hossein Hojjat
  • Fahimeh Raja
چکیده

This paper describes creation of a test collection for Persian Part of Speech Tagging experiments. This collection was created by modifying a manually Part of Speech (POS) tagged Persian corpus with over two million tagged words. The original collection had a tag set of 550 tags that are more than what any machine learning algorithm can handle. The number of tags for these experiments was reduced to only 40 tags. Since the main motivation for creation of this collection is building and testing efficient Persian Part of Speech taggers, a benchmark tagger was created for that used Maximum Likelihood Estimation (MLE) for guessing the correct tags in Persian. Two different versions this tagger were produced that differed on handling “Unseen words”. One of them tagged the unseen words with a “DEFAULT” tag and in the other one with “N-SING” which was the most frequent tag used in the corpus. The best accuracy that was achieved by MLE tagging was (95.43%). The results of the experiments are encouraging as they are comparable to other languages such as English German and Spanish. This test corpus will be distributed among other researchers to be used as a base for Persian natural language processing.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An improved joint model: POS tagging and dependency parsing

Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...

متن کامل

Unsupervised Part of Speech Tagging for Persian

In this paper we present a rather novel unsupervised method for part of speech (below POS) disambiguation which has been applied to Persian. This method known as Iterative Improved Feedback (IIF) Model, which is a heuristic one, uses only a raw corpus of Persian as well as all possible tags for every word in that corpus as input. During the process of tagging, the algorithm passes through sever...

متن کامل

سیستم برچسب گذاری اجزای واژگانی کلام در زبان فارسی

Abstract: Part-Of-Speech (POS) tagging is essential work for many models and methods in other areas in natural language processing such as machine translation, spell checker, text-to-speech, automatic speech recognition, etc. So far, high accurate POS taggers have been created in many languages. In this paper, we focus on POS tagging in the Persian language. Because of problems in Persian POS t...

متن کامل

A Persian Part-Of-Speech Tagger Based on Morphological Analysis

This paper describes a method based on morphological analysis of words for a Persian Part-Of-Speech (POS) tagging system. This is a main part of a process for expanding a large Persian corpus called Peyekare (or Textual Corpus of Persian Language). Peykare is arranged into two parts: annotated and unannotated parts. We use the annotated part in order to create an automatic morphological analyze...

متن کامل

A Hybrid Morphology-Based POS Tagger for Persian

In many applications of natural language processing (NLP) grammatically tagged corpora are needed. Thus Part of Speech (POS) Tagging is of high importance in the domain of NLP. Many taggers are designed with different approaches to reach high performance and accuracy. These taggers usually deal with inter-word relations and they make use of lexicons. In this paper we present a new tagging algor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007